VAST 2010 Challenge: the giCentre, City University London

Jo Wood, the giCentre, City University London, jwo@soi.city.ac.uk [PRIMARY contact]
Jason Dykes, the giCentre, City University London, jad7@soi.city.ac.uk
Aidan Slingsby, the giCentre, City University London, a.slingsby@soi.city.ac.uk

Tool(s):

PandemView was developed using Processing - a set of Java libraries for rapid development of graphical and visualization sketches. Development was speeded up by using Processing Utilities - reusable tools developed by the giCentre for construction of interactive visualization applications. We also used MySQL for storage of the normalised datasets and geographic city locations. Creation of the PandemView applications shown here took approximately 16 hours of development time.

Video:

ANSWERS:

MC2.1: Analyze the records you have been given to characterize the spread of the disease. You should take into consideration symptoms of the disease, mortality rates, temporal patterns of the onset, peak and recovery of the disease. Health officials hope that whatever tools are developed to analyze this data might be available for the next epidemic outbreak. They are looking for visualization tools that will save them analysis time so they can react quickly.

Cleaning the data

Descriptions of symptoms ("admission syndrome") were inconsistent and error prone. The first stage was therefore to standardize these descriptions as much as possible. Candidate terms for standardization were identified by creating alphabetical tag clouds of the unprocessed syndromes and examining repeated and unexpected terms. Since words were sized by frequency, effort could be directed towards standardizing the more important terms. This suggested three forms of correction. Firstly, punctuation was removed and non-alphanumeric symbols replaced (e.g. "&" with "and"). Secondly misspellings, abbreviations and synonyms were replaced with their stem equivalents (e.g. "abdominal", "adb", and "admnal" were all replaced with "abdomen"). Thirdly, syndrome descriptions that consisted of the same phrase repeated twice (presumably due to data input error) were spotted and corrected. Cleaning in this way reduced the number of distinct syndrome descriptions by 10-20% and increased the reliability of term frequency analysis. The process of identifying terms to be changed took about 45 minutes, but the list of replacement rules was stored in a separate file to be reused and amended if further records were to be processed. Cleaned data were stored in a MySQL database.

Identifying Drafa Fever Symptoms

Symptoms most likely to be associated with Drafa fever were identified by examining symptom frequencies as an alphabetical tag cloud for each country (Figure 1, top). While this identified the most common symptoms it did not demonstrate they were associated with Drafa Fever. An assumption was made that fatal admissions were more likely to be Drafa cases than non-fatal admissions (this assumption was tested below). This allowed expected symptom frequencies to be compared with observed fatal symptom frequencies (Figure 1, lower). The larger red terms provided the candidate Drafa symptoms being both common and more frequent than expected in fatal admissions. The top 8 Drafa diagnostic symptoms were therefore identified as abdomen, back, bleeding, death, diarrhea, fever, pain and vomiting. A 'Drafa score' for each admission was stored ranging from 0 (none of the diagnostic symptoms) to 1 (all of the diagnostic symptoms).

Figure 1: tag clouds of admission symptoms, aden, yemen. The upper cloud shows all symptoms; the lower cloud shows frequency of symptoms of patients who died. Terms are coloured according to expected frequency for all admissions.

Spread of the disease

PandemView (Figure 2) was used to explore spatio-temporal patterns of the disease. We could filter the data by time (left-right arrows) and by place (up-down arrows) and choose different hospital admission summary measures. All graphs show change over time from 16th April to 30th June 2009 along the horizontal axis. Figure 2 shows numbers of fatal admissions over time highlighting Tabriz, Iran on the 19th May 2009. The approximately normal distribution of fatalities over time for 9 of the 11 cities demonstrates that fatal cases of Drafa Fever dominate hospital deaths over the period and supports the assumption that examining fatal admissions provides a good basis for establishing Drafa fever symptoms.

Figure 2: the pandemview application showing four views of the data. The user interactively selects the day to view, the place to view and the hospital admission summary to view. Top-left shows the map view for a particular day, where size of circle is proportional to the selected summary measure. Bottom left shows how a selected summary view changes over time (from left to right) for a given place. gender (pink/blue) and age (line graph) profiles are also shown. Symptoms for the given day and place are shown in the centre column, ordered by frequency or tf-idf (see mc2.2 below). The right-hand column shows the time graphs for the selected summary for all places.

In order to see whether part of the population was vulnerable to fatal contraction of Drafa fever, the number of deaths was broken down by gender (Figure 2 bottom left, pink and blue stacked bars). A 50% male/female line was superimposed on the graph allowing any over- or under- representation of male/female admissions to be spotted. Examining each country at the peak of the disease outbreak revealed no such gender relationship. The mean age of all patients on each day for each place was calculated and superimposed on the graph (Figure 2 bottom left, age scale on right-hand axis). There appeared to be no obvious age-related difference between those patients admitted with the disease and those without it (mean age mid 40s throughout the period).

One of the summary measures that can be shown in PandemView is the mean number of days between hospital admission and death for each city on each day. An example of Aleppo, Syria is shown in Figure 3 (date on the horizontal axis, average number of days to death on the vertical axis). The remarkably consistent peak at 8 days during the Drafa outbreak period is obvious from this graph. There appears to be no gender-related pattern.

Figure 3: number of days between hospital admission and death for all fatal cases in aleppo, syria. female fatalities shown in pink, male fatalities shown in blue.

To account for uncertainty in diagnosing the disease from the reported symptoms, one of the measures displayable in PandemView is the number of people with 0, 1, 2...8 of the diagnostic symptoms. This is represented as a cumulative bar chart showing likelihood of Drafa diagnosis over time (Figure 4). The darker the colour, the more certain we can be that a patient has Drafa fever. The fact that even patients presenting only one of the diagnostic symptoms peaks at about 150,000 per day for Karachi against a background submission rate of about 50,000 suggests the disease peaks at about 100,000 cases per day in Karachi against a fatal admission rate of about 8000. This gives a mortality rate for those contracting the disease of about 8%. This was supported by similar patterns in other infected cities.

Figure 4: numbers of patients with diagnostic Drafa symptoms for Karachi, Pakistan. Date represented on the horizontal axis, numbers of patients on the vertical axis.

MC2.2: Compare the outbreak across cities. Factors to consider include timing of outbreaks, numbers of people infected and recovery ability of the individual cities. Identify any anomalies you found.

PandemView provides two ways of comparing Drafa outbreaks across cities. The map view (top left in Figure 2) allows spatial patterns to be observed (see video for examples). Given the relative sparsity of geographic variation in the dataset, it was helpful to graph outbreak summaries over time for each of the cities simultaneously. By aligning them vertically sharing the same horizontal date axis, these graphs provide a direct temporal comparison across cities (Figure 5).

Figure 5: Numbers of fatal admissions over time for all cities (click for larger version). the left hand column scales the bars globally (i.e.. with respect to the peak fatalities in karachi). the right hand column scales each city's fatalities locally (i.e.. with respect to the city's own peak fatalities). in this example the vertical timeline is centred on nairobi's peak fatalities on 14th may.

Column 1 of Figure 5 shows clearly that the largest number of patients are admitted in Karachi, Aleppo and Nairobi. They also have the correspondingly largest numbers of deaths and suspected Drafa cases. Examination of the graphs and their axes provided the precise numbers involved.

Timing of the outbreaks

Moving the time slider allowed a direct comparison of the onset, peak and decay of fatalities over time (Figure 5). Because the very start and end of the disease fatalities involve relatively low numbers, the most diagnostic measure was found to be the peak of fatalities. Looking at non-fatal diagnostic symptoms gives a less clear picture than the fatal cases since many of these symptoms are also present in other unrelated conditions.

Of the countries for which we have data, the disease would appear to have originated in Africa, moving north to the Middle East and Pakistan before appearing in northern South America. This transmission is extremely rapid, taking less than a week to spread globally.

Aden appeared to take the longest to recover despite being one of the earlier cities to become infected. Given that recovery from fatal cases is likely to be due to distribution of antivirals and suitable isolation, we can hypothesise that such facilities were less readily available than in other cities. Barranquilla and Jedah appeared to have the most rapid recoveries possibly due to treatment facilities and mobility of the local population.

Disease Variants

While 9 cities show evidence of a major outbreak, some show evidence of two distributions of disease transmission. Examining the distribution of fatalities over time for Tabriz, a clear bimodal distribution with a first peak around the 4th May and the main peak around the 18th May can be seen (Figure 6).

Figure 6: Fatal admissions over time for tabriz showing bimodal peaks in fatalities.

This suggested two disease distributions during an overlapping time period. One explanatory hypothesis is that each of the two peaks represents a different form of the disease with different symptoms. The second hypothesis was that treatment of the first form of the disease was successful, but was initially not successful with the second major outbreak.

To examine the first hypothesis, the most common symptoms associated with each place and day were considered. In particular whether the symptoms for the 4th May peak were different to the 18th May peak. The symptoms column of PandemView was used to explore this (Figure 7).

Figure 7: Admission symptoms for tabriz. Columns 1 and 2: 4th and 18th may outbreaks, ordered by symptom frequency; Columns 3 and 4: 4th and 18th may outbreaks, ordered by tf-idf. length of grey bars represent the frequency of the each term (also shown as absolute numbers and % of symptom types).

There appeared to be little significant difference between the most common symptoms in the two outbreaks (columns 1 and 2 of Figure 7). There was slightly more diversity in symptom type (grey bars decay less rapidly in column 1 than in column 2). This may simply be due to smaller numbers of admissions in the earlier outbreak. In order to see if there were any distinctive symptoms associated with either outbreak, the TF-IDF score for each symptom was calculated. More usually associated with textual analysis of document corpora, TF-IDF measures the uniqueness of any given term and is shown in columns 3 and 4 of Figure 7. The grey bars still show term frequency so that distinctive, but rare symptoms can be discounted. No significant difference was observed between the two outbreaks.

The bimodal distribution could be observed in other cities, namely Jeddah, Beirut, Barcelona and Barranquilla (Figure 5). The first peak did not decay as rapidly in these cities as it did in Tabriz suggesting treatment policy was not as effective in these locations. The slight positive skew to the distributions in Karachi and Nairobi (Figure 5), suggest that they too may have been affected by this earlier more infectious strain, but that no significant treatment was available for it.

Anomalies

The most obvious anomalies are the cities of Mersin and Nonthaburi that appear not to be infected with the fatal strain of the disease. This is evident from the relatively low number of deaths, showing an approximately random rectangular distribution over time rather than a normal distribution. Hospital mortality rates remain at a relatively constant 0.1% over the period. There are a couple of days where no apparent admissions took place suggesting there may be some uncertainty in the record keeping, especially the dates of admission.

Order	City	Disease start	Disease peak	Full recovery	Affected period (days)
1	Nairobi	24th April	14th May	17th June	54
2	Aleppo	25th April	15th May	18th June	54
3	Aden	24th April	16th May	22nd June	59
3	Beirut	26th April	16th May	15th June	50
5	Karachi	26th April	17th May	19th June	54
6	Jeddah	27th April	18th May	16th June	50
6	Tabriz	28th April	18th May	19th June	52
8	Barcelona	28th April	19th May	18th June	51
9	Barranquilla	30th April	20th May	18th June	49

giCentre, City University London - PandemView

VAST 2010 Challenge
Hospitalization Records - Characterization of Pandemic Spread

Authors and Affiliations:

Tool(s):

Video:

ANSWERS:

Cleaning the data

Identifying Drafa Fever Symptoms

Spread of the disease

Timing of the outbreaks

Disease Variants

Anomalies

giCentre, City University London - PandemView

VAST 2010 Challenge Hospitalization Records - Characterization of Pandemic Spread

Authors and Affiliations:

Tool(s):

Video:

ANSWERS:

Cleaning the data

Identifying Drafa Fever Symptoms

Spread of the disease

Timing of the outbreaks

Disease Variants

Anomalies

VAST 2010 Challenge
Hospitalization Records - Characterization of Pandemic Spread